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Abstract 



O 

> 
O 

Motivated by vision tasks such as robust face and object recognition, we con- 
sider the following general problem: given a collection of low-dimensional linear 
subspaces in a high-dimensional ambient (image) space and a query point (im- 
t age), efficiently determine the nearest subspace to the query in i 1 distance. We 

j 1 show in theory that Cauchy random embedding of the objects into significantly - 

m lower-dimensional spaces helps preserve the identity of the nearest subspace with 

constant probability. This offers the possibility of efficiently selecting several can- 
h-J didates for accurate search. We sketch preliminary experiments on robust face and 

digit recognition to corroborate our theory. 

| 1 Introduction 

f^. Big data often come with prominent dimensionality and volume. Statistical learning and inference 

IT} may be hard at such scales, due not only to conceivable computational burdens, but to stability 

C^- issues. Nevertheless, parsimony that almost invariably dominates data-generating process dictates 

that structures associated with big data may be significant. For example, all images of a Lambertian 
convex object of fixed pose undergoing (remote) illumination changes lie approximately on a very 
low (« 9)-dimensional linear subspace, in the image space of dimensionality equal to the number 
(N| of pixels per image (normally ~ 10000) [4]. Examples that have low-dimensional manifold struc- 

t— H tures abound (see, e.g., [17, 9]). By assuming reasonable structures for data, we can then turn the 

inference problem as an instance of nearest structure search, of which nearest subspace search may 
. i-H serve as a basic building block: 

>< 

Problem 1 (Nearest Subspace Search) Given n linear subspaces Si , . . . , S n of dimension r in R D 
and a query point q G JR/°, determine the nearest Si to q. 

Exploiting structures in big data has greatly helped in providing attractively simple formulations for 
learning and inference, and the remaining tasks are to make concrete the measure of "nearness" and 
to design efficient algorithm to solve the search problem. 

Measure of Nearness. Typically, one adopts a metric •) on ~R D , and then sets d(q,£j) = 
min vG4 5. d(q, v). Certainly the appropriate choice of metric d depends on our prior knowledge. For 
example, if the observation q is known to be perturbed by i.i.d. Gaussian noise from its originating 
subspace, minimizing the £ 2 norm d(q, v) = ||q — v || 2 yields a maximum likelihood estimator. 
However, in practice other norms may be more appropriate: particularly in situations where the data 
may have sparse but significant errors, the t 1 norm is a more robust alternative [7, 23]. For images, 
such errors are due to factors such as occlusions, shadows, specularities. We focus on the choice of 
i 1 norm here and our main problem is 



*This work is published in the 12 th European Conference on Computer Vision (ECCV 2012) [19]. A full 
version with technical details is available online http://arxiv. org/ abs/1208. 0432 [20]. 
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Problem 2 (Main, Nearest Subspace Search in I 1 ) Given n linear subspaces <Si, . . . , S n of di- 
mension r in R D and a query point q G R D , determine the nearest Si to q in i 1 distance. 

Efficiency of Algorithms. We would like solve Problem 2 using computational resources that 
depend as gracefully as possible on the ambient dimension D and the number of models n, both of 
which could be very large for big data. The straightforward solution proceeds by solving a sequence 
of n i 1 regression problems 

d £ i (q,5i) = min ||q- v||i. (1) 

The total cost is 0{n • T$\ (D : r)), where Tgi (D, r) is the time required to solve (1). The best known 
complexity guarantees for solving (1), based on scalable first-order methods [11, 5, 25, 24], are su- 
perlinear in D, though linear runtimes may be achievable when the residual q — v* is very sparse 
[10] or the problem is otherwise well- structured [1]. So even in the best case, the straightforward so- 
lution has complexity Q(nD). When both terms are large, this dependence is prohibitive: Although 
Problem 2 is simple to state and easy to solve in polynomial time, achieving real-time performance 
or scaling massive databases appears to require a more careful study. 

Dealing with D by Cauchy Random Embedding. We present a very simple, practical approach 
to Problem 2 with much improved dependency on D 1 . Rather than working directly in the high- 
dimensional space R D , we randomly embed the query q and subspaces Si into R d , with d <C D. 
The random embedding is given by a d x D matrix P whose entries are i.i.d. standard Cauchy's. 
That is to say, instead of solving (1), we solve 

d £ i (Pq,P5i) = min ||Pq-Pv||i. (2) 

We prove that if the embedded dimension d is sufficiently large - say d = poly(r log n), then with 
constant probability the model Si obtained from (2) is the same as the one obtained from the original 
optimization (1). The required dimension d does not depend in any way on the ambient dimension 
D, and is often significantly smaller: e.g., d = 25 vs. D = 32, 000 for one typical example of face 
recognition. The resulting (small) i 1 regression problems are amenable to customized interior point 
solvers (e.g., [16]). The price paid for this improved complexity is a small increase in the probability 
of failure to locate the nearest subspace. Our theory quantifies how large d needs to be to render this 
probability of error under control. Repeated trials with independent projections P can then be used 
to make the probability of failure as small as desired. 

2 Cauchy Random Embedding and Theoretical Analysis 

Our entire algorithm relies on the standard Cauchy distribution with p.d.f. 

p c (x) = l/[7r(l + x 2 )], (3) 

which is 1 -stable [21] and heavy-tailed (shown in Figure 1). The core of our algorithm is summa- 
rized as follows (right). 




Input: n subspaces S\ , • • • , S n of dimension r and query q 
Output: Identity of the closest subspace S* to q in i 1 distance 

Preprocessing: Generate P G H dxD with i.i.d. Cauchy RV's 
(d <C D) and Compute the projections P<Si, • • • , P<S n 
Test: Compute the projection Pq, and compute its i 1 distance 
to each of P& 



Figure 1: Standard Cauchy 

Our main theoretical result states that if d is chosen appropriately, with at least constant 
probability, the subspace S^ selected will be the original closest subspace S*\ 



! The reason that we do not deal with n concurrently is discussed in Sec 3. 



2 



Theorem 3 Suppose we are given n linear subspaces {Si, • • • , S n } of dimension r in M, D and any 
query point q, and that the i 1 distances of q to each of {Si, • • • , S n } are £]_/ < • • • < £ n / when 
arranged in ascending order, with > V > 1- For any fixed a < 1 — I/77, £/^re gmto 



d~ O 



(r log n) 



l/a 



(assuming n>r),if~PeM J dx is iid Cauchy, we have 

arg min dgi (Pq, PSj) = argmin^i (q, Si) (4) 

ie[n] ie[n] 



with (nonzero) constant probability. 

At least two things are interesting about Theorem 3: 1) d depends on the relative gap rj, ratio of 
distances to the closest and to the second closest subspaces. Notice that 77 E [1, 00), and that the 
exponent l/a becomes large as 77 approaches one. This suggests that our dimensionality reduction 
will be most effective when the relative gap is nonnegligible; 2) d depends on the number of models 
n only through its logarithm. This rather weak dependence is a strong point, and, interestingly, 
mirrors the Johnson-Lindenstrauss lemma for dimensionality reduction in £ 2 , even though JL-syle 
embeddings are impossible for i 1 . 

Additional practical implications of Theorem 3 are in order: 1) First, Theorem 3 only guarantees 
success with constant probability. This probability is easily amplified by taking a small number of 
independent trials. Each of these trials generates one or more candidate subspaces S^. We can then 
perform i 1 regression in R D to determine which of these candidates is actually nearest to the query; 
2) Since the gap 77 is one important factor controlling the resource demanded, if we have reason to 
believe that 77 will be especially small, we may instead set d according to the gap between £1/ and 
for some k' > 2. With this choice, Theorem 3 implies that with constant probability the desired 
subspace is amongst the k' — 1 nearest (saved for further examination) to the query. If k' <C n, this 
is still a significant saving over the naive approach. 



Idea of the Analysis. We present the full technical details in the report [20], while high- 
light the intuition behind analysis now. Figure 2 shows a histogram of the random vari- 
able ip = dp. (Pq, PS) (dp (q, S) is normalized), over randomly generated Cauchy matri- 
ces P, for two different configurations of query q and subspace S. Two properties are es- 
pecially noteworthy. First, the upper tail of the distribution can be quite heavy: with non- 
negligible probability, xp may significantly exceed its median. In constrat, the lower tail is 
much better behaved: with very high probability, ip is not significantly smaller than its median. 
This inhomogeneous behavior (in particular, the 
heavy upper tail) precludes very tight distance- 
preserving embeddings using the Cauchy. However, 
our goal is not to find an (near-isometric) embedding 
of the data, per se, but rather to find the nearest sub- 
space, to the query. In fact, it suffices to show that 
with nontrivial constant probability 

• P does not increase the distance from q to 
S* too much; and, 

• P does not shrink the distance from q to Fi § ure 2: An illustration of how random Cauchy 
any of the other subspaces S* too much. embedding changes the query-to-subspace t dis- 

tance in statistics. 

The observed inhomogeneous behavior is much less of an obstacle to establishing the desired re- 
sults. 




3 Related Work 

Problem 2 is an example of a subspace search problem. In £ 2 , for r = and r = 1, efficient 
algorithms with sublinear query complexity in n exist for the approximate versions [8, 2]. For 
r > 1, recent attempts [3, 13] offered promising numerical examples, but not sublinear complexity 
guarantees. Results in theoretical computer science suggest that these limitations may be intrinsic 
to the problem [22] . 
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These attempts exploit special properties of the £ 2 version of Problem 2, and do not apply to 
its I 1 variant. However, the I 1 variant retains the aforementioned difficulties, suggesting that an 
algorithm for i 1 near subspace search with sublinear dependence on n is unlikely as well. 2 This 
motivates us to focus on ameliorating the dependence on D. Our approach is very simple and very 
natural: Cauchy projections are chosen because the Cauchy is the unique 1 -stable distribution, a 
property which has been widely exploited in previous algorithmic work [8, 14, 18]. 

However, on a technical level, it is not obvious that Cauchy embedding should succeed for 
this problem. The Cauchy is a heavy tailed distribution, and because of this it does not yield 
embeddings that very tightly preserve distances between points, as in the Johnson-Lindenstrauss 
lemma. In fact, for i 1 , there exist lower bounds showing that certain point sets in i 1 cannot be 
embedded in significantly lower-dimensional spaces without incurring non-negligble distortion [6] . 
For a single subspace, embedding results exist - most notably due to Soehler and Woodruff [18], 
but the distortion incurred is so large as to render them inapplicable to Problem 2. 

4 Experimental Verification 

Again we highlight part of our experiments here and more details can be found in the report [20] . 
We take the The Extended Yale B face dataset [12] (n = 38, D = 168 * 192 - 30000) and treat 
the facial images of one person as lying on a 9 -dimensional linear subspace (as argued in [4] and 
practiced in [23]). For each subject, we take half of the images for training (1205 in total) and the 
others for testing (1209 in total). To better illustrate the behavior of our algorithm, we strategically 
divided the test set into two subsets: moderately illuminated (909, Subset M) and extremely 
illuminated (300, Subset E). 



Figure. 3 presents typical evolution of recognition rate on Subset M as the projection di- 
mension (d) grows with only one repetition of the projection. The high-dimensional NS (HDS) in 
i 1 achieves perfect (100%) recognition, and the recognition rate (also probability of success as in 
Theorem 3) stays stable above 95% with d > 25. Suppose the distance gap is significant such that 
1/a — » 1, our theorem predicts d = r log n = 9 * log 38 « 33. 





Ordered Subject Index 



Figure 3: Recognition rate versus projection di- 
mension (d) with one repetition on Subset M face 
images of EYB. 

For extremely illuminated face images, the 



Figure 4: Samples of moderately/extremely illumi- 



nated face images and their i 
subspaces. 



distances to other subject 



1 distance gap between the first and second nearest 
subspaces is much less significant (one example shown in Figure 4). Our theory suggests d should 
be increased to compensate for the weak gap (because the exponent 1/a becomes significant). Our 
experimental results in Table 1 confirm this prediction. 

Table 1: Recognition Rate on Subset E of EYB with varying d and Nback (# candidates for further test). 





HDS 


d = 25 


d = 50 


d = 70 


r = 15, N back = 


5 


94.7% 


79.3% 


87.7% 


92.3% 


r = lb,N hack = 


10 


94.7% 


87.3% 


92.0% 


94.0% 



2 Although it could be possible if we are willing to accept time and space complexity exponential in r or D, 
ala[15]. 
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